Using QSARDB to create a melting point model web service
Researchers
Andrew SID Lang and Villu Ruusmann
[This page is a duplicate (backup) of the original model page on the ONSChallange wiki]
Objective
To create a relatively small (as compared to our other melting point models) Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.
Background
Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann. We were successful in a previous analysis MPModel009, but the resulting QDB archive was too large to deploy. Here we develop a model on a smaller (but highly curated) dataset and perform feature selection to reduce the number of descriptors used with the goal of creating a good model of reasonable size - less than 10MB - the current limit of the QsarDB repository.
Procedure
Data Collection and Curation. We began with the doubleplusgood melting point dataset (ONSMP029) of 2706 highly curated double+ validated (range: 0.1-5 C) unique compounds that have no chiral centers or possess cis/trans isomerism. From this set we removed coronene and octaphenylcyclotetrasiloxane as they are obvious outliers of the chemical space. For the remaining 2704 compounds, we generated all CDK descriptors except: CPSA, IP, WHIM, all protein, all geometrical. We then removed HybRatio and Kier3 due to multiple NA entries and all khs.xxx with less than 27 (1%) non-zero values, leaving 161 descriptors. Feature Selection. While Random Forest models have no problems with highly correlated variables, using highly correlated variables can skew variable importance measures. We decided to use the caret package for R to remove highly correlated descriptors (BCUTc-1h, apol, naAromAtom, nAromBond, nAtom, ATSc2, ATSc3, ATSm2, ATSp1, ATSp2, ATSp3, ATSp4, ATSp5, nB, C1SP1, SCH-3, VCH-4, VC-5, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-5, SPC-6, VPC-5, ECCEN, Kier1, Kier2, VABC, WTPT-1, WPATH, WPOL, Zagreb) found using the following code:
This leaves 2704 compounds with 118 descriptors ready for modeling.
Modeling. A random forest was created and serialized using the following code:
library("randomForest")
mydata = read.csv(file="20120607DoubleValidatedReadyRF.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 39
Mean of squared residuals: 1451.971
% Var explained: 83.34
[output]
## get variable importance plot
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.83 and an OOB RMSE of 38.1 °C with the resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.
The model was then saved so that it could be deployed as a web service using the following code:
saveRDS(mydata.rf, file = "ONSMPModel010")
The model is available for to use for batch melting point prediction with a CC0 license. The model was then used to predict the melting points of the training set in order to identifier possible errors in the dataset and compounds with are difficult to model using current 2D CDK descriptors (such as coronene).
Plotting the predicted versus measured melting point values using Tableau Public, we see that the melting point of compounds tends to increase with larger TopoPSA (colour) and nHBDon (size) with the top outliers being: cyanic iodide, 2,6-dimethoxy-p-benzoquinone, 2-methyl-4-nitro-1h-imidazole, isophthalic_acid, 2,2,3,3-tetramethylbutane, 2-(1,3-thiazol-4-yl)-1h-benzimidazole, 2-mercaptobenzimidazole, p-quaterphenyl, 4,4'-dihydroxybiphenyl, 2,4-hexadiyne.
QDB Format Archive. A parallel QDB format archive was created using the same data and the following code in a CMD window:
The QDB archive was then deployed for use as a webservice to the QsarDB Open Digital Repository
Results
An accurate (R2 0.83) melting point model using Open descriptors and Open data was developed and deployed on the QsarDB Open Digitial Repository where it can be used as a webservice.
Using QSARDB to create a melting point model web service
Researchers
Andrew SID Lang and Villu Ruusmann[This page is a duplicate (backup) of the original model page on the ONSChallange wiki]
Objective
To create a relatively small (as compared to our other melting point models) Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.Background
Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann. We were successful in a previous analysis MPModel009, but the resulting QDB archive was too large to deploy. Here we develop a model on a smaller (but highly curated) dataset and perform feature selection to reduce the number of descriptors used with the goal of creating a good model of reasonable size - less than 10MB - the current limit of the QsarDB repository.Procedure
Data Collection and Curation. We began with the doubleplusgood melting point dataset (ONSMP029) of 2706 highly curated double+ validated (range: 0.1-5 C) unique compounds that have no chiral centers or possess cis/trans isomerism. From this set we removed coronene and octaphenylcyclotetrasiloxane as they are obvious outliers of the chemical space. For the remaining 2704 compounds, we generated all CDK descriptors except: CPSA, IP, WHIM, all protein, all geometrical. We then removed HybRatio and Kier3 due to multiple NA entries and all khs.xxx with less than 27 (1%) non-zero values, leaving 161 descriptors.Feature Selection. While Random Forest models have no problems with highly correlated variables, using highly correlated variables can skew variable importance measures. We decided to use the caret package for R to remove highly correlated descriptors (BCUTc-1h, apol, naAromAtom, nAromBond, nAtom, ATSc2, ATSc3, ATSm2, ATSp1, ATSp2, ATSp3, ATSp4, ATSp5, nB, C1SP1, SCH-3, VCH-4, VC-5, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-5, SPC-6, VPC-5, ECCEN, Kier1, Kier2, VABC, WTPT-1, WPATH, WPOL, Zagreb) found using the following code:
library("caret") ## load in data mydata = read.csv(file="20120607DoubleValidatedReadyForFeatureSelection.csv",head=TRUE,row.names="molID") ## correlation matrix cor.mat = cor(mydata) ## find correlation r > 0.90 findCorrelation(cor.mat, cutoff = .90, verbose = TRUE) [output] 7 12 13 14 15 17 18 22 26 27 28 29 30 32 34 43 49 59 61 62 63 64 65 66 67 69 70 71 72 73 74 76 78 79 81 83 121 122 151 153 158 159 161 [output]This leaves 2704 compounds with 118 descriptors ready for modeling.
Modeling. A random forest was created and serialized using the following code:
library("randomForest") mydata = read.csv(file="20120607DoubleValidatedReadyRF.csv",head=TRUE,row.names="molID") ## do random forest [randomForest 4.5-34] mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 39 Mean of squared residuals: 1451.971 % Var explained: 83.34 [output] ## get variable importance plot varImpPlot(mydata.rf,main="Random Forest Variable Importance")The RF reports an OOB R2 of 0.83 and an OOB RMSE of 38.1 °C with the resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.
The model was then saved so that it could be deployed as a web service using the following code:
The model is available for to use for batch melting point prediction with a CC0 license. The model was then used to predict the melting points of the training set in order to identifier possible errors in the dataset and compounds with are difficult to model using current 2D CDK descriptors (such as coronene).
Plotting the predicted versus measured melting point values using Tableau Public, we see that the melting point of compounds tends to increase with larger TopoPSA (colour) and nHBDon (size) with the top outliers being: cyanic iodide, 2,6-dimethoxy-p-benzoquinone, 2-methyl-4-nitro-1h-imidazole, isophthalic_acid, 2,2,3,3-tetramethylbutane, 2-(1,3-thiazol-4-yl)-1h-benzimidazole, 2-mercaptobenzimidazole, p-quaterphenyl, 4,4'-dihydroxybiphenyl, 2,4-hexadiyne.
QDB Format Archive. A parallel QDB format archive was created using the same data and the following code in a CMD window:
The Random Forest for the QDB archive was then created in R using:
suppressMessages(library("randomForest")) qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010" propertyId = 'mpC' descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid', 'ATSc1', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm3', 'ATSm4', 'ATSm5', 'nBase', 'bpol', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3', 'C4SP3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3', 'VC-4', 'VC-6', 'SP-7', 'VP-6', 'SPC-4', 'VPC-4', 'VPC-6', 'FMF', 'nHBDon', 'nHBAcc', 'khs.sCH3', 'khs.dCH2', 'khs.ssCH2', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC', 'khs.sNH2', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.ssssSi', 'khs.sSH', 'khs.dS', 'khs.ssS', 'khs.aaS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'MW', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'XLogP') loadValues = function(path, id){ result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A") result = na.omit(result) names(result) = c('Id', gsub("-", "_", x = id)) return (result) } loadPropertyValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id)) } loadDescriptorValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id)) } rfdata = loadPropertyValues(propertyId) for(descriptorId in descriptorIdList){ print (descriptorId) rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id') } compoundIds = rfdata$Id rfdata$Id = NULL rfmodel = randomForest(formula = mpC ~ ., data = rfdata) print(rfmodel) object = list() object$propertyId = propertyId object$getPropertyId = function(self){ return (self$propertyId) } object$descriptorIdList = descriptorIdList object$getDescriptorIdList = function(self){ return (self$descriptorIdList) } object$rfmodel = rfmodel object$evaluate = function(self, values){ suppressMessages(require("randomForest")) descriptorIdList = self$getDescriptorIdList(self) descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x)) newrfdata = data.frame(c = NA) for(i in 1:length(descriptorIdList)){ newrfdata[descriptorIdList[i]] = values[i] } return (predict(self$rfmodel, newdata = newrfdata)) } saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object) rfvalues = predict(rfmodel, rfdata) predictedValues = data.frame(compoundIds, rfvalues) write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "mpC"), row.names = FALSE, quote = FALSE, sep = "\t")The model was then zipped and tested using:
The QDB archive was then deployed for use as a webservice to the QsarDB Open Digital Repository
Results
An accurate (R2 0.83) melting point model using Open descriptors and Open data was developed and deployed on the QsarDB Open Digitial Repository where it can be used as a webservice.